{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(data-quickstart)=\n",
    "# Data Analysis Quickstart\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Here we'll do a whistlestop tour of data analysis in Python using a structure called a *dataframe*. Dataframes do everything a spreadsheet does, and a whole lot more. At their simplest, dataframes are a tabular representation of data with rows and columns. The data in each column can be anything; text, numbers, Python objects such as lists or dictionaries, or even other dataframes!\n",
    "\n",
    "The ability to extract, clean, and analyse data is one of the core skills any economist needs. Fortunately, the (open source) tools that are available for data analysis have improved enormously in recent years, and working with them can be a delight——even the most badly formatted data can be beaten into shape.\n",
    "\n",
    "In this chapter, we'll see analysis on a *single* dataframe using the Star Wars' characters dataset as an example. For a more thorough grounding in using data, see the next chapter.\n",
    "\n",
    "This chapter uses the [**pandas**](https://pandas.pydata.org/) and [**numpy**](https://numpy.org/) packages. If you're running this code, you may need to install these packages. The Anaconda distribution of Python comes with **pandas** and **numpy** installed. If you don't have these installed, you can install them by running either `conda install packagename` or `pip install packagename` on your computer's command line. You can find a brief guide to installing packages on {ref}`code-preliminaries`.\n",
    "\n",
    "This chapter is hugely indebted to the fantastic [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), and both the **pandas** [documentation](https://pandas.pydata.org/docs/user_guide/index.html) and amazing [introductory tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html).\n",
    "\n",
    "## Loading data and checking datatypes\n",
    "\n",
    "Loading data into a dataframe is achieved with commands like `df = pd.read_csv(...)` or `df = pd.read_stata(...)`. Let's load the Star Wars data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from pathlib import Path\n",
    "\n",
    "# Set seed for reproducibility\n",
    "np.random.seed(10)\n",
    "# Set max rows displayed for readability\n",
    "pd.set_option(\"display.max_rows\", 6)\n",
    "# Plot settings\n",
    "plt.style.use(\n",
    "    \"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt\"\n",
    ")\n",
    "\n",
    "df = (pd.read_csv(\n",
    "    \"https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv\",\n",
    "    index_col=0,\n",
    "    )\n",
    "    .dropna(subset=[\"species\"])\n",
    "    )\n",
    "# Check info about dataframe\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Look at the first few rows with `head()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "What happens if you pass a number to `head()`, eg `head(10)`?\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter rows and columns with conditions using `df.loc[condition(s) or row(s), column(s)]`\n",
    "\n",
    "`.loc` stands for location and allows you to filter (aka subset) a dataframe. `.loc` works like an index, so it always comes with square brackets, eg `df.loc[...]`.\n",
    "\n",
    "`loc` takes two arguments. The first is a list of the names of the rows that you'd like to select *or* a condition (ie a list of booleans with the same length as the dataframe) that selects certain rows. Remember, you can easily create a series of booleans by checking a column against a condition, for example `df['column1'] == 'black'`.\n",
    "\n",
    "The second argument consists of a list of column names you'd like to select. In both cases, `:` is shorthand for 'use all rows' or 'use all columns'. If you have *either* condition(s) *or* column(s) (but not both), you can simply write `df[condition(s)]` or `df[column(s)]`.\n",
    "\n",
    "Here's an example with a condition built up out of two parts and a list of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[(df[\"hair_color\"] == \"brown\") & (df[\"eye_color\"] == \"blue\"), [\"name\", \"species\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Using `loc`, filter the dataframe to `mass` greater than 50 for the `name` and `homeworld` columns\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sort rows or columns with `.sort_values()`\n",
    "\n",
    "Use `sort_values(columns, ascending=False)` for descending order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sort_values([\"height\", \"mass\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Using `sort_values`, sort the dataframe by the `name` column.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Choose multiple rows or columns using slices\n",
    "\n",
    "Slices can be passed by name using `.loc[startrow:stoprow:step, startcolumn:stopcolumn:step]` or by position using `.iloc[start:stop:step, start:stop:step]`.\n",
    "\n",
    "Choosing every 10th row from the second, and the columns between 'name' and 'gender':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[2::10, \"name\":\"gender\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `loc` only works here with numbers for rows because it just so happens that the names of the rows *are* numbers. If the rows had names that were strings, and we wanted to subset rows by their index position, we would have to use `iloc` instead.\n",
    "\n",
    "Choosing the first 5 rows and the last 2 columns by index position:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[:5, -2:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Using `.iloc`, display the first 6 rows and last 6 columns.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Randomly selecting a sample using `.sample`\n",
    "\n",
    "`.sample(n)` randomly selects `n` rows, `.sample(frac=0.4)` selects 40% of the data, `replace=True` samples with replacement, and passing `weights=` selects a number or fraction with the probabilities given by the passed weights. (Note that weights passed should have the same length as the dataframe.)\n",
    "\n",
    "Taking a sample of 5 rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Use `.sample` to sample 5% of the dataframe.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rename with `.rename`\n",
    "\n",
    "You can rename all columns by passing a function, for instance `df.rename(columns=str.lower)` to put all columns in lower case. Alternatively, use a dictionary to say which columns should be mapped to what:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.rename(columns={\"homeworld\": \"home_world\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add new columns with `.assign` or assignment\n",
    "\n",
    "Very often you will want to create new columns based on existing columns.\n",
    "\n",
    "![](https://pandas.pydata.org/docs/_images/05_newcolumn_1.svg)\n",
    "\n",
    "There are two ways to do this. Let's see them both with an example where we'd like to create a new column of height in metres, called `\"height_m:`.\n",
    "\n",
    "- The first, and most commonly used, is called *assignment* and involves just entering the new column name within your dataframe and putting it on the left-hand side of an assignment expression that has an operation based on existing dataframe columns on the right-hand side. For example, `df['height_m'] = df['height']/100`.\n",
    "- The second is to use the `assign` method on a dataframe directly. In this case, the assignment statement appears inside the brackets but you don't need to write as much text because it's clear from the context that, on the left-hand side of the assignment, we're talking about the given dataframe. An example is `df.assign(height_m=df[\"height\"] / 100)`.\n",
    "\n",
    "Let's see working examples of both of these assignment methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's use the assignment approach:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['height_m'] = df['height']/100\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now with the `.assign` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.assign(height_m=df[\"height\"] / 100)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This was added to the end; ideally, we'd like it next to the height column, which we can achieve by sorting the columns (`axis=1`) alphabetically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(df.assign(height_m=df[\"height\"] / 100).sort_index(axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To overwrite existing columns just use `height = df['height']/100` with the `assign` method or `df['height'] = df['height']/100` with an assignment expression.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Add a new column that gives the ratio of mass to height.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarise numerical values with `.describe()`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group variables values with `.groupby()`\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(\"species\")[[\"height\", \"mass\"]].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Find the standard deviation (using `std()`) of masses by `homeworld`.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add transformed columns using `.transform()`\n",
    "\n",
    "Quite often, it's useful to put a column into a dataframe that is the result of an intermediate groupby and aggregation. For example, subtracting the group mean or normalisation. Transform does this and *returns a transformed column with the same shape as the original dataframe*. Transform preserves the original index. (There are other methods, such as apply, that return a new dataframe with the groupby variables as a new index.)\n",
    "\n",
    "Below is an example of transform being used to demean a variable according to the mean by species. Note that we are using lambda functions here. Lambda functions are a quick way of writing functions without needing to give them a name, e.g. `lambda x: x+1` defines a function that adds one to x. In the example below, the `x` in the lambda function takes on the role of mass grouped by species."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"mass_demean_species\"] = df.groupby(\"species\")[\"mass\"].transform(lambda x: x - x.mean())\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Create a `height_demean_homeworld` column that gives the height column with the mean height by homeworld subtracted.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make quick charts with `.plot.*`\n",
    "\n",
    "Including scatter, area, bar, box, density, hexbin, histogram, kde, and line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.scatter(\"mass\", \"height\", alpha=0.5);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.box(\"species\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"height\"].plot.kde(bw_method=0.3);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export results and descriptive statistics\n",
    "\n",
    "You'll often want to export your results to a latex file for inclusion in a paper, presentation, or poster. Let's say we had some descriptive statistics on a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table = df[[\"mass\", \"height\"]].agg([np.mean, np.std])\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can export this to a range of formats, including string, html, xml, markdown, the *clipboard* (so you can paste it), Excel, and more. In your favourite IDE with a Python language server (eg Code, JupyterLab) start typing `table.to` and a list of possible methods beginning `to` should appear, including `to_string()`.\n",
    "\n",
    "Here is an example of exporting your **pandas** table to a LaTeX table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(table.to_latex(caption=\"A Table\", label=\"tab:descriptive\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing to the terminal isn't that useful for getting your paper or report done! To export to a file, use `table.to_latex('file.tex', ...)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Exercise\n",
    "Try exporting the table above using the `to_string(\"table.txt\")` method.\n",
    "\n",
    "If you are running this locally, the file should appear in the directory in which you are running this notebook.\n",
    "\n",
    "If you are using Google Colab to do these exercises, you can check that the file exported by running `!ls` in a new code cell to see all files in the current notebook directory. To get the contents of the file you created, run `!cat table.txt`.\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "interpreter": {
   "hash": "c4570b151692b3082981c89d172815ada9960dee4eb0bedb37dc10c95601d3bd"
  },
  "kernelspec": {
   "display_name": "Python 3.8.12 64-bit ('codeforecon': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
